Combination of multilingual and semi-supervised training for under-resourced languages

نویسندگان

Frantisek Grézl

Martin Karafiát

چکیده

Multilingual training of neural networks for ASR is widely studied these days. It has been shown that languages with little training data can benefit largely from the multilingual resources for training. The use of unlabeled data for the neural network training in semi-supervised manner has also improved the ASR system performance. Here, the combination of both methods is presented. First, multilingual training is performed to obtain an ASR system to automatically transcribe the unlabeled data. Then, the automatically transcribed data are added. Two neural networks are trained one from random initialization and one adapted from multilingual network to evaluate the effect of multilingual training under presence of larger amount of training data. Further, the CMLLR transform is applied in the middle of the stacked Bottle-Neck neural network structure. As the CMLLR rotates the features to better fit given model, we evaluated whether it is better to adapt the existing NN on the CMLLR features or if it is better to train it from random initialization. The last step in our training procedure is the fine-tuning on the original data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indonesian Dependency Treebank: Annotation and Parsing

We introduce and describe ongoing work in our Indonesian dependency treebank. We described characteristics of the source data as well as describe our annotation guidelines for creating the dependency structures. Reported within are the results from the start of the Indonesian dependency treebank. We also show ensemble dependency parsing and self training approaches applicable to under-resourced...

متن کامل

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences . Experiments prove that these corpora can be effectively used as training sets for supe...

متن کامل

Exploring Confidence-based Self-training for Multilingual Dependency Parsing in an Under-Resourced Language Scenario

This paper presents a novel self-training approach that we use to explore a scenario which is typical for under-resourced languages. We apply self-training on small multilingual dependency corpora of nine languages. Our approach employs a confidence-based method to gain additional training data from large unlabeled datasets. The method has been shown effective for five languages out of the nine...

متن کامل

Sentiment Classification in Under-Resourced Languages Using Graph-Based Semi-Supervised Learning Methods

In sentiment classification, conventional supervised approaches heavily rely on a large amount of linguistic resources, which are costly to obtain for under-resourced languages. To overcome this scarce resource problem, there exist several methods that exploit graph-based semisupervised learning (SSL). However, fundamental issues such as controlling label propagation, choosing the initial seeds...

متن کامل

Very Low Resource Radio Browsing for Agile Developmental and Humanitarian Monitoring

We present a radio browsing system developed on a very small corpus of annotated speech by using semi-supervised training of multilingual DNN/HMM acoustic models. This system is intended to support relief and developmental programmes by the United Nations (UN) in parts of Africa where the spoken languages are extremely under resourced. We assume the availability of 12 minutes of annotated speec...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Combination of multilingual and semi-supervised training for under-resourced languages

نویسندگان

چکیده

منابع مشابه

Indonesian Dependency Treebank: Annotation and Parsing

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

Exploring Confidence-based Self-training for Multilingual Dependency Parsing in an Under-Resourced Language Scenario

Sentiment Classification in Under-Resourced Languages Using Graph-Based Semi-Supervised Learning Methods

Very Low Resource Radio Browsing for Agile Developmental and Humanitarian Monitoring

عنوان ژورنال:

اشتراک گذاری